========================================================

Introduction

This report looks at the effect of 11 variables on white wine quality in a data set of almost 4900 wines. It was produced as part of the Udacity Data Analyst Nanodegree program. The data is taken from:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The data set contains 13 columns.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The first column ‘X’ appears to be a row count and can be ignored in the analysis. That leaves the following independent variables for analysis:

  1. fixed acidity (g/L) - tartaric acid, non-volatile
  2. volatile acidity (g/L) - acetic acid, causes a vinegar taste
  3. citric acid (g/L) - makes wine taste “fresh”
  4. residual sugar (g/L) - sugar that remains after fermentation
  5. chlorides (g/L) - sodium chloride or salt
  6. free sulfur dioxide (mg/L) - prevents microbial growth and oxidation in wine
  7. total sulfur dioxide (mg/L) - free and bound sulfur dioxide
  8. density (g/mL) - average wine density is close to the density of water depending on alcohol and sugar levels
  9. pH - measure of acidity and basicity; most wines are acidic
  10. sulphates (g/L) - potassium sulphate, an antimicrobial and antioxidant additive
  11. alcohol (%) - percent of alcohol in a wine by volume

There is one dependent variable:

  1. quality - a wine taster’s assigned rating on a scale of 0 to 10, presumably influenced by the above independent variables.

Quality is a categorical value and will need to be converted to a factor for analysis.

wine$quality <- as.factor(wine$quality)

That leaves 11 variables which may influence the quality of a wine to explore.

Univariate Plots Section

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Quality forms a normal curve with most wines in the data set receiving the median rating. The best and worst wines are barely represented, with only 5 wines ranked 9 and 20 wines ranked 3 compared to the 2198 wines ranked 6. Either there’s a bias in how the data was collected, leading to end values being excluded, or it’s hard for most wines to stand out as either great or terrible.

The three acidity measures roughly fit normal distributions with long tails to the right. A few wines in the data set have high levels of acetic (measured by volatile acidity) or citric acid.

A histrogram of the pH doesn’t reflect the long tail seen in the three acids plots, however. pH mostly has a normal distribution with a few levels over represented.

Sugar and chlorides are skewed to the left with long rightward tails.

The three sulfur measurements show a similar leftward skew, but with most of the data forming a normal distribution and then a small number of samples stretching to the right. Sulphates has interesting gaps every 0.1 g/L.

Density shows the same leftward skew as previous measurements.

Alcohol has an interesting pattern similar to sulphates, but with gaps in the data at more frequent intervals. Certain alcohol levels have either no or few samples. Maybe this is a result of how alcohol levels are measured and rounded to the nearest value.

A log transform gives the acidity measurements normal distributions and reveals the same stacatto pattern present in alcohol and sulphates.

Log transforming citric acid seems to reverse the skew and result in a long leftward tail instead of a rightward tail.

A log transform of residual sugar reveals a bimodal distribution.

Chlorides and free sulfur dioxide have their distributions pulled more towards the center by a log transform.

A log transform of density doesn’t change the distribution much. The long rightward tail is most likely caused by outliers rather than the distribution of the data.

wine$fixed.acid.log <- log10(wine$fixed.acidity)
wine$volatile.acid.log <- log10(wine$volatile.acidity)
wine$chlorides.log <- log10(wine$chlorides)
wine$free.sulfur.dioxide.log <- log10(wine$free.sulfur.dioxide)
wine$sugar.log <- log10(wine$residual.sugar)

Log transformations of variables to either give them a normal distribution or reveal a bimodal distribution in the case of residual sugar.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations of 12 variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. Most of the variables are continuous except for quality which is categorical with levels from 3 (worst quality) to 9 (best quality) counting by whole numbers.

Most wines are of quality 4, 5, or 6. Many of the variables have a leftward skew. Residual sugar has a bimodal distribution. Alcohol measurements show a stacatto pattern of many measurements followed by few or none.

What is/are the main feature(s) of interest in your dataset?

Quality is the dependent variable in the data set. The other variables presumably affect the rating that a wine taster gives. I’m curious which variables best correlate with quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I would guess variables involving acidity, chlorides, and sulfur would affect the taste of wine and influence a taster’s rating.

Did you create any new variables from existing variables in the dataset?

I created log transformed variables of fixed acidity, volatile acidity, chlorides, free sulfur dioxide, and residual sugar based on histograms that demonstrated those variables took on normal distributions when log transformed (or in the case of sugar had a bimodal distribution). I also converted quality into a factor so R will treat it as a categorical variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and density all showed a strong leftward skew in the data. I performed log transformations on these variables, which resulted in fixed acidity, volatile acidity, chlorides, and free sulfur dioxide forming distributions closer to normal. Citric acid’s distribution inverted, obtaining a leftward tail as opposed to a rightward tail. Residual sugar appears to have a bimodal distribution when log transformed. The log transformation did not affect the density distribution, suggesting that the long rightward tail is the result of outliers and not the bulk of the data.

Bivariate Plots Section

There are high positivie correlations between residual sugar and density, total sulfur dioxide and density, free sulfur dioxide and total sulfur dioxide, and residual sugar and total sulfur dioxide.

There are high negative correlations between alcohol and density, alcohol and residual sugar, alcohol and chlorides, alcohol and total sulfur dioxide, and pH and fixed acidity.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654           7.9            0.330        0.28           31.6
## 1664 1664           7.9            0.330        0.28           31.6
## 2782 2782           7.8            0.965        0.60           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1654     0.053                  35                  176 1.01030 3.15
## 1664     0.053                  35                  176 1.01030 3.15
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality fixed.acid.log volatile.acid.log
## 1654      0.38     8.8       6      0.8976271       -0.48148606
## 1664      0.38     8.8       6      0.8976271       -0.48148606
## 2782      0.69    11.7       6      0.8920946       -0.01547269
##      chlorides.log free.sulfur.dioxide.log sugar.log
## 1654     -1.275724                1.544068  1.499687
## 1664     -1.275724                1.544068  1.499687
## 2782     -1.130768                0.903090  1.818226

Three observations have densities greater than 1.01. These are probably responsible for the long rightward tail on the density histogram that a log transformation could not correct. They could be influencing the high correlations observed above, so I’ll remove them from subsequent plots and analyses.

## [1] 0.8320888

There’s a positive correlation between residual sugar and density, which makes sense as more sugar would make a wine denser.

A scatterplot using the log transformed residual sugar variable reveals the bimodal distribution. Low sugar wines have a homogenous dispersal with regards to density while high sugar wines have a positive linear relationship with density. Other variables may be influencing density in low sugar wines, resulting in a lack of a trend, while in high sugar wines, sugar is the main factor influencing density.

## [1] 0.7665352

The correlation between density and residual sugar is about 6% smaller when the bimodal distribution is taken into account. A linear correlation is obviously not the best model for comparing these two variables.

## [1] -0.8041518

A negative correlation is seen between alcohol and density, again making sense as sugar gets converted into alcohol during fermentation. If sugar directly contributes to density, then density will decreases as the sugar is consumed by yeast.

## [1] -0.4591654

I would expect alcohol and residual sugar to correlate since sugar is converted into alcohol during fermentation, and both variables have a correlation with density of around 80%. There is a negative correlation between the two variables, but not as high as 80%. The bimodal distribution in sugar complicates the comparison.

## [1] -0.3937291

Using the log transformed sugar variable reduces the strength of the correlation.

## [1] 0.2891867

Citric acid has a correlation of around 29% with fixed acidity, implying that citric acid explains a little under a third of fixed acidity. Nothing else correlates as highly with fixed acidity on the correlation matrix, however.

## [1] 0.2925708

While log transforming fixed acidity improves the distribution of the fixed acidity histogram, the transformation does not contribute much to the correlation with citric acid.

## [1] 0.5424823

## [1] 0.2602505

In addition to residual sugar, total sulfur dioxide and chlorides also contribute to the density of a wine.

## [1] 0.400329

The log transformed chlorides variable shows a higher correlation than the base variable.

## [1] -0.5012869

There’s a negative trend where the more chlorides a wine has, the less alcohol it has as well.

## [1] 0.1469724

There’s no obvious relationshp between chlorides and sugar.

## [1] -0.4487925

As with chlorides, there’s a negative trend where the more total sulfur dioxide a wine has, the less alcohol it has.

## [1] 0.4182469

The more sugar a wine has, the more total sulfur dioxide it also has.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

There seems to be a strong trend towards higher alcohol wines receiving higher ratings. This pattern is reversed for the first three rankings, with quality increasing with decreasing alcohol. Other factors may affect ratings at the first three ranks, but then alcohol becomes a driver of quality.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.392   9.900  26.050 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1549  0.2007  0.6628  0.6233  1.0293  1.2095 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1549  0.1139  0.3979  0.4872  0.8513  1.2443 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218  0.2553  0.8451  0.7036  1.0607  1.3711 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1549  0.2304  0.7243  0.6457  0.9956  1.4158 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.04576  0.23045  0.56225  0.56820  0.86480  1.28443 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.09691  0.32222  0.63347  0.62019  0.91378  1.17026 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2041  0.3010  0.3424  0.4992  0.6232  1.0253

The ranges for residual sugar overlap with each other at each quality level, and plotting the log of residual sugar doesn’t reveal any new patterns.

## subset(wine.subset, sugar.log < 0.5)$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.188   1.475   1.562   1.700   2.900 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.125   1.350   1.473   1.700   3.000 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.200   1.400   1.533   1.800   3.150 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.200   1.500   1.644   1.900   3.100 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.400   1.600   1.783   2.200   3.100 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.400   1.800   1.801   2.125   2.900 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log < 0.5)$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   1.800   2.000   1.933   2.100   2.200

## subset(wine.subset, sugar.log > 0.5)$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.500   4.975  10.050   9.613  12.100  16.200 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   5.100   7.400   8.153  10.600  17.550 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.20    7.00    9.35   10.21   13.10   23.50 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   6.200   8.500   9.369  12.400  26.050 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   4.800   6.900   8.142  10.900  19.250 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   4.700   7.100   8.131  10.900  14.800 
## -------------------------------------------------------- 
## subset(wine.subset, sugar.log > 0.5)$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.2     5.8     7.4     7.4     9.0    10.6

The bimodal distribution of sugar is probably concealing any patterns in the boxplots. I split the data along 0.5 g/L for residual sugar as that is where the two groups separate on a scatterplot graph. There is a trend of increasing means for residual sugar level with increasing quality for the < 0.5 g/L wines. There doesn’t appear to be a pattern for the > 0.5 g/L wines.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.575   7.300   7.600   8.525  11.800 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.800   6.400   6.900   7.129   7.600  10.200 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   6.400   6.800   6.934   7.400  10.300 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.836   7.300  14.200 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.200   6.700   6.735   7.200   9.200 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.800   6.657   7.300   8.200 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.60    6.90    7.10    7.42    7.40    9.10

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6232  0.8177  0.8632  0.8702  0.9307  1.0719 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6812  0.8062  0.8388  0.8483  0.8808  1.0086 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6532  0.8062  0.8325  0.8379  0.8692  1.0128 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5798  0.7993  0.8325  0.8316  0.8633  1.1523 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6232  0.7924  0.8261  0.8256  0.8573  0.9638 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5911  0.7924  0.8325  0.8198  0.8633  0.9138 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8195  0.8388  0.8513  0.8676  0.8692  0.9590

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2602  0.3000  0.7850 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7696 -0.6244 -0.5850 -0.5100 -0.3864 -0.1938 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.95861 -0.56864 -0.49485 -0.45775 -0.33734  0.04139 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.61979 -0.55284 -0.54145 -0.46852 -0.04335 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0969 -0.6990 -0.6021 -0.6067 -0.5229 -0.1051 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0969 -0.7212 -0.6021 -0.6061 -0.4949 -0.1192 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9208 -0.6990 -0.5850 -0.5878 -0.4815 -0.1805 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6198 -0.5850 -0.5686 -0.5322 -0.4437 -0.4437

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2100  0.2575  0.3450  0.3360  0.3850  0.4700 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1900  0.2900  0.3042  0.4000  0.8800 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3377  0.4100  1.0000 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3256  0.3600  0.7400 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.2800  0.3200  0.3265  0.3600  0.7400 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   0.340   0.360   0.386   0.450   0.490

The three acidity measures have no apparent relationship to quality.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0150  0.0360  0.0430  0.0452  0.0490  0.2550 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6576 -1.4410 -1.3872 -1.3336 -1.2678 -0.6126 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8861 -1.4202 -1.3372 -1.3318 -1.2676 -0.5376 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0458 -1.3979 -1.3279 -1.3187 -1.2757 -0.4609 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8239 -1.4437 -1.3665 -1.3709 -1.3098 -0.5935 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9208 -1.5086 -1.4318 -1.4335 -1.3565 -0.8697 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8539 -1.5229 -1.4437 -1.4364 -1.3565 -0.9172 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.745  -1.678  -1.509  -1.576  -1.495  -1.456

The means for chloride levels trend downward with increasing quality. Amount of salt in a wine could directly influence a taster’s rating, or this pattern could just be from chloride’s positive correlation with density and alcohol’s negative correlation with density.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.25   33.50   53.33   47.50  289.00 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   18.00   23.36   30.50  138.50 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   35.00   36.43   50.00  131.00 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   24.00   34.00   35.66   46.00  112.00 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.13   41.00  108.00 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   28.00   35.00   36.72   44.50  105.00 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0    27.0    28.0    33.4    31.0    57.0

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0   105.8   159.5   170.6   210.0   440.0 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.0   117.0   125.3   171.5   272.0 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   121.0   151.0   150.9   182.0   344.0 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      18     107     132     137     164     294 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.1   144.2   229.0 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0   102.5   122.0   126.2   150.0   212.5 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      85     113     119     116     124     139

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2800  0.3800  0.4400  0.4745  0.5425  0.7400 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4700  0.4761  0.5400  0.8700 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2700  0.4200  0.4700  0.4822  0.5300  0.8800 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4800  0.5031  0.5800  1.0800 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4600  0.4862  0.5850  0.9500 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.360   0.420   0.460   0.466   0.480   0.610

Amounts of sulfur dioxide and sulphate have no apparent relationship with quality.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

There might be an increase in quality with increase in pH. However, there are so few wines at rank 9 that that mean is questionable when compared to the other ranks.

## wine.subset$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0001 
## -------------------------------------------------------- 
## wine.subset$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0004 
## -------------------------------------------------------- 
## wine.subset$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## -------------------------------------------------------- 
## wine.subset$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9939  0.9959  1.0030 
## -------------------------------------------------------- 
## wine.subset$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0004 
## -------------------------------------------------------- 
## wine.subset$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006 
## -------------------------------------------------------- 
## wine.subset$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9897  0.9898  0.9903  0.9915  0.9906  0.9970

The boxplots for density roughly mirror those for alcohol, with density decreasing from quality ranks 5 to 9. There is an increase in density from rank 4 to 5, matching the decrease in alcohol from 4 to 5 on those boxplots. This should all be expected due to the negative correlation between alcohol and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Some of the features I thought would correlate with quality (acidity and sulfur) did not. Chlorides did show a trend with the mean amount of salt in a wine decreasing with increasing quality. The strongest pattern though came from alcohol, which appears to be the main driver of a taster’s rating. The more alcohol a wine has, the higher its rating.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Residual sugar, total sulfur dioxide, and chlorides poitively correlate with density, as should be expected as increasing solutes increases the density of a solution. Alcohol negatively correlates with density, again as expected since sugar is converted into alcohol during fermentation.

What was the strongest relationship you found?

Quality ranks 5 through 9 each show increasing levels of alcohol. Alcohol appears to be the main predictor of a wine taster’s rating.

Multivariate Plots Section

Higher quality wines have a lower density regardless of sugar level.

High quality wines concentrate in the high alcohol, low density levels.

Both low and high sugar wines have higher quality ratings in the higher alcohol ranges.

As quality level increases, the majority of wines in each rank shift to the right toward higher alcohol levels. There’s a distinct split between high and low sugar wines, but these groups don’t shift with quality the way alcohol does.

High quality wines have low amounts of chlorides, low densities, and high alcohol levels.

The split in the residual sugar observations happens around 0.5 g/L. Dividing the alcohol by chlorides plots along this line doesn’t reveal any new details.

High quality wines appear to have less total sulfur dioxide than low quality wines.

High quality wines have low density and low total sulfur dioxide levels.

High quality wines have high alcohol and low total sulfur dioxide levels.

Higher quality wines cluster in the lower lefthand corner of low chlorides and low total sulfur dioxide.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The highest quality wines tend to have high levels of alcohol, low densities, and low chloride and total sulfur dioxide levels independent of sugar level. Sugar, chloride, and sulfur dioxide all contribute to density. Since high quality wines tend to have low densities independent of sugar, density is probably a proxy for chloride and sulfur dioxide levels.

Were there any interesting or surprising interactions between features?

High quality wines tend to have a low density even when residual sugar levels are high. Since a scatterplot of residual sugar versus density showed a high correlation between those two variables, other variables that increase density, such as chlorides and sulfur dioxide, must be lower in high sugar wines of high quality. Plots of residual sugar versus chlorides and residual sugar versus total sulfur dioxide with points colored for quality showed this, with high sugar, high quality wines being on the low end of the chloride and total sulfur dioxide scales.


Final Plots and Summary

Plot One

Description One

The median percentage of alcohol in a wine increases with quality across the middle three grades where the highest number of observations are. Quality grades 5, 6, and 7 have 1457, 2198, and 880 observations respectively. There is a decreases in percentage of alcohol from grades 3 to 5, however, grade 3 has 20 observations and grade 4 has 163. These low numbers compared to the middle three grades could be skewed by an over representation of higher alcohol content wines. Likewise, even though grades 8 and 9 fit the pattern of increasing alcohol with increasing quality, the number of observations in each (175 and 5 respectively), means conclusions from those should be viewed with caution.

Plot Two

Description Two

There is a bimodal distribution for residual sugar in the white wine data set. This probaby reflects the distinction between sweet and dry wines. High alcohol wines will be ranked as high quality by wine tasters regardless of sugar level. In the upper left hand corner, an island of high quality wines can be seen that are close to the lowest alcohol levels. These are also some of the sweetest wines. It appears that the one exception to high alcohol wines receiving high ratings is high sugar wines.

Plot Three

Description Three

The highest quality wines also have some of the lowest salt (sodium chloride) and total sulfur dioxide levels. Wine tasters primarily enjoy alcohol while disliking flavors introduced by salt and sulfur dioxide.


Reflection

The white wine data set contains about 4900 wines and 12 variables. I began with a series of histograms on each variable to get a sense for the shape of the distributions. A lot of variables were skewed to the left with long rightward tails. Log transformations gave these variables normal distributions except for density where only three observations were forming the long rightward tail. I considered these points outliers and removed them from subsequent analyses due to how few observations there were. The one exception to the normal distributions was residual sugar, which had a binomial distribution. This is most likely due to wines traditionally being either dry or sweet. The most interesting part of the histograms were gaps in the data for variables such as alcohol or volatile acidity once it was log transformed. The gaps seem most common on the left side of the histograms, in the lower range of the relevant unit. Perhaps this is an artifact of measurement processes that round to a nearest value.

I was most surprised at how clearly alcohol percentage predicted wine quality in a boxplot, while every other variable had little to no difference between the interquartile ranges of the different quality ranks. Wine tasters appear to be biased by alcohol over other factors contributing to flavor. However, the first few ranks contradict this with alcohol percentage decreasing with increaseing quality from ranks 3 to 5. This might be the result of much fewer observations in ranks 3 and 4 though (20 and 163 observations respectively) compared to rank 5 (1457 wines). A small number of observations is more likely to be biased by extreme values. For this reason, conclusions about ranks 8 and 9 (175 and 5 observations respectively) should also be viewed with caution.

It was also interesting to see the relationship of several variables with density. Sugar, sodium chloride, and total sulfur dioxide all positively correlated with density while alcohol negatively correlated. The fact that solutes contribute to the density of a solution and sugar is converted into alcohol during fermentation is visible in the data set.

The largest issues I had were figuring out how to handle the bimodal distribution of residual sugar and investigating the other variables influencing density. I split the data set in half for two analyses involving residual sugar, looking at boxplots of sugar versus quality for wines with < 0.5 g/L of residual sugar and > 0.5 g/L, as well as scatterplots of alcohol versus chloride content split by sugar level. There’s a pattern of dry wines receiving a higher rating with higher sugar content (so the wine tasters don’t want their dry wines too dry?). No pattern was observed for sweet wines, but they have so much sugar to begin with that differences in residual sugar levels may not be enough to affect quality scores. It makes sense that level of sugar would have a larger impact in dry wines.

Chlorides and sulfur dioxide are trickier to interpret. High sugar wines receive a higher score when they have a lower density. Chlorides and sulfur dioxide contribute to density, implying that higher chloride and sulfur dioxide levels reduce a wine’s score. However, while high chloride and sulfur dioxide wines have lower scores, they also have less alcohol.

I would like to examine the companion red wine data set to test how it compares. More observations for the lowest and highest quality wines would also be extremely helpful to test if the pattern of rising quality with rising alcohol percentage holds. The red wine data set has fewer observations than this one, so combining the two wouldn’t necessarily address this issue.